Journal of Molecular Evolution
○ Springer Science and Business Media LLC
All preprints, ranked by how well they match Journal of Molecular Evolution's content profile, based on 21 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Theobald, D.; Sennett, M. A.; Beckett, B. C.
Show abstract
Ancestral sequence resurrection (ASR) is the inference of extinct biological sequences from extant sequences, the most popular of which are based on probabilistic models of evolution. ASR is becoming a popular method for studying the evolution of enzyme characteristics. The properties of ancestral enzymes are biochemically and biophysically characterized to gain some knowledge regarding the origin of some enzyme property. Current methodology relies on resurrection of the single most probable (SMP) sequence and is systematically biased. Previous theoretical work suggests this will result in a thermostability bias in resurrected SMP sequences, and even the activity, calling into question inferences derived from ancestral protein properties. We experimentally test the potential stability bias hypothesis by resurrecting 40 malate and lactate dehydrogenases. Despite the methodological bias in resurrecting an SMP protein, the measured biophysical and biochemical properties of the SMP protein are not biased in comparison to other, less probable, resurrections. In addition, the SMP protein property seems to be representative of the ancestral probability distribution. Therefore, the conclusions and inferences drawn from the SMP protein are likely not a source of bias. SignificanceAncestral sequence resurrection (ASR) is a powerful tool for: determining how new protein functions evolve; inferring the properties of an environment in which species existed; and protein engineering applications. We demonstrate, using lactate and malate dehydrogenases (L/MDHs), that resurrecting the single most probable sequence (SMP) from a maximum likelihood phylogeny does not result in biased activity and stability relative to sequences sampled from the posterior probability distribution. Previous studies using experimentally measured phenotypes of SMP sequences to make inferences about the environmental conditions and the path of evolution are likely not biased in their conclusions. Serendipitously, we discover ASR is also a valid tool for protein engineering because sampled reconstructions are both highly active and stable.
Beura, P. K.; Aziz, R.; Sen, P.; Das, S.; Namsa, N. D.; Feil, E.; Satapathy, S. S.; Ray, S. K.
Show abstract
Transition (ti) and transversion (tv) are the major causes for genome variation. The accurate estimation of ti to tv ratio [Formula] in genomes is crucial for understanding of mutational and selection processes in organisms as it is influenced by both codon degeneracy and pretermination codons (PTC). Therefore, we developed a method (accessible at https://github.com/CBBILAB/CBBI.git) to estimate [Formula] ratio by accounting codon degeneracy as well as PTC in protein coding sequences. Our findings revealed a distinct impact of codon degeneracy and PTC on the [Formula] ratio in the Escherichia coli genome. We observed a decreasing order among the frequencies of different base substitutions such as synonymous transition (Sti) > synonymous transversion (Stv) > non-synonymous transition (Nti) > non-synonymous transversion (Ntv) in E. coli genome. The correlation was strong between Sti and Stv values (Pearson r value 0.795) whereas the correlation was weak between Sti and Nti (Pearson r value 0.192). Coding sequences with similar Sti values exhibited a wide range of Nti values. This indicated the varying strength of purifying selection acting on the coding sequences. In concordance with the assumption, the genes having higher Nti values were observed with lower codon adaptation index (CAI) values than that of the genes having lower Nti values. Our approach is convenient to visualize the frequency of base substitution variation as well as selection in protein coding sequences. The proposed method is useful to estimate different [Formula] ratios accurately in coding sequences and is insightful from an evolutionary perspective. Article SummaryGenetic diversity is pivotal in evolution, with base substitution as a key driver. Transition (ti) frequency surpasses transversion (tv) frequency in genomes, making [Formula] ratios a valuable metric for studying mutation bias. Our improved estimator for [Formula] calculation accounts for codon degeneracy and nonsense substitutions in pretermination codons. Additionally, we unveil insights into the frequency of different substitutions such as Sti, Stv, Nti, and Ntv and demonstrate the impact of selection on protein coding sequences.
Coray, D. S.; Sibaeva, N.; McGimpsey, S.; Gardner, P. P.
Show abstract
The reactions of functional molecules like proteins and RNAs to mutation affect both host cell viability and biomolecular evolution. These molecules are considered robust if function is maintained despite mutations. Proteins and RNAs have different structural and functional characteristics that affect their robustness, and to date, comparisons between them have been theoretical. In this work, we test the relative mutational robustness of RNA and protein pairs using three approaches: evolutionary, structural, and functional. We compare the nucleotide diversities of functional RNAs with those of matched proteins. Across different levels of conservation, we found the nucleotide-level variations between the biomolecules largely overlapped, with proteins generally supporting more variation than matched RNAs. We then directly tested the robustness of the protein and RNA pairs with in vitro and in silico mutagenesis of their respective genes. The in silico experiments showed that proteins and RNAs reacted similarly to point mutations and insertions or deletions, yet proteins are slightly more robust on average than RNAs. In vitro, mutated fluorescent RNAs retained greater levels of function than the proteins. Overall this suggests that proteins and RNAs have remarkably similar degrees of robustness, with the average protein having moderately higher robustness than RNA as a group.\n\nSignificance StatementThe ability of proteins and non-coding RNAs to maintain function despite mutations in their respective genes is known as mutational robustness. Robustness impacts how molecules maintain and change phenotypes, which has a bearing on the evolution and the origin of life as well as influencing modern biotechnology. Both protein and RNA have mechanisms that allow them to absorb DNA-level changes. Proteins have a redundant genetic code and non-coding RNAs can maintain structure and function through flexible base-pairing possibilities. The few theoretical treatments comparing protein and RNA robustness differ in their conclusions. In this experimental comparison of protein and RNA, we find that they have remarkably similar degrees of overall genetic robustness.
Moshensky, D.; Alexeevski, A.
Show abstract
The origin and evolution of genes that have common base pairs (overlapping genes) are of particular interest due to their influencing each other. Especially intriguing are gene pairs with long overlaps. In prokaryotes, co-directional overlaps longer than 60 bp were shown to be nonexistent except for some instances. A few antiparallel prokaryotic genes with long overlaps were described in the literature. We have analyzed putative long antiparallel overlapping genes to determine whether open reading frames (ORFs) located opposite to genes (antiparallel ORFs) can be protein-coding genes.\n\nWe have confirmed that long antiparallel ORFs (AORFs) are observed reliably to be more frequent than expected. There are 10 472 000 AORFs in 929 analyzed genomes with overlap length more than 180 bp. Stop codons on the opposite to the coding strand are avoided in 2 898 cases with Benjamini-Hochberg threshold 0.01.\n\nUsing Ka/Ks ratio calculations, we have revealed that long AORFs do not affect the type of selection acting on genes in a vast majority of cases. This observation indicates that long AORFs translations commonly are not under negative selection.\n\nThe demonstrative example is 282 longer than 1 800 bp AORFs found opposite to extremely conserved dnaK genes. Translations of these AORFs were annotated \"glutamate dehydrogenases\" and were included into Pfam database as third protein family of glutamate dehydrogenases, PF10712. Ka/Ks analysis has demonstrated that if these translations correspond to proteins, they are not subjected by negative selection while dnaK genes are under strong stabilizing selection. Moreover, we have found other arguments against the hypothesis that these AORFs encode essential proteins, proteins indispensable for cellular machinery.\n\nHowever, some AORFs, in particular, dnaK related, have been found slightly resisting to synonymous changes in genes. It indicates the possibility of their translation. We speculate that translations of certain AORFs might have a functional role other than encoding essential proteins.\n\nEssential genes are unlikely to be encoded by AORFs in prokaryotic genomes. Nevertheless, some AORFs might have biological significance associated with their translations.\n\nAuthor summaryGenes that have common base pairs are called overlapping genes. We have examined the most intriguing case: if gene pairs encoded on opposite DNA strands exist in prokaryotes. An intersection length threshold 180 bp has been used. A few such pairs of genes were experimentally confirmed.\n\nWe have detected all long antiparallel ORFs in 929 prokaryotic genomes and have found that the number of open reading frames, located opposite to annotated genes, is much more than expected according to statistical model. We have developed a measure of stop codon avoidance on the opposite strand. The lengths of found antiparallel ORFs with stop codon avoidance are typical for prokaryotic genes.\n\nComparative genomics analysis shows that long antiparallel ORFs (AORFs) are unlikely to be essential protein-coding genes. We have analyzed distributions of features typical for essential proteins among formal translations of all long AORFs: prevalence of negative selection, non-uniformity of a conserved positions distribution in a multiple alignment of homologous proteins, the character of homologs distribution in phylogenetic tree of prokaryotes. All of them have not been observed for the majority of long AORFs. Particularly, the same results have been obtained for some experimentally confirmed AOGs.\n\nThus, pairs of antiparallel overlapping essential genes are unlikely to exist. On the other hand, some antiparallel ORFs affect the evolution of genes opposite that they are located. Consequently, translations of some antiparallel ORFs might have yet unknown biological significance.
Elofsson, A.
Show abstract
1It is well known that the GC content varies enormously between organisms; this is believed to be caused by a combination of mutational preferences and selective pressure. Within coding regions, the variation of GC is more substantial in position three and smaller in position one and two. Less well known is that this variation also has an enormous impact on the frequency of amino acids as their codons vary in GC content. For instance, the fraction of alanines in different proteomes varies from 1.1% to 16.5%. In general, the frequency of different amino acids correlates strongly with the number of codons, the GC content of these codons and the genomic GC contents. However, there are clear and systematic deviations from the expected frequencies. Some amino acids are more frequent than expected by chance, while others are less frequent. A plausible model to explain this is that there exist two different selective forces acting on the genes; First, there exists a force acting to maintain the overall GC level and secondly there exists a selective force acting on the amino acid level. Here, we use the divergence in amino acid frequency from what is expected by the GC content to analyze the selective pressure acting on codon frequencies in the three kingdoms of life. We find four major selective forces; First, the frequency of serine is lower than expected in all genomes, but most in prokaryotes. Secondly, there exist a selective pressure acting to balance positively and negatively charged amino acids, which results in a reduction of arginine and negatively charged amino acids. This results in a reduction of arginine and all the negatively charged amino acids. Thirdly, the frequency of the hydrophobic residues encoded by a T in the second codon position does not change with GC. Their frequency is lower in eukaryotes than in prokaryotes. Finally, some amino acids with unique properties, such as proline glycine and proline, are limited in their frequency variation.
Goodheart, J. A.; Rio, R. A.; Taraporevala, N. F.; Fiorenza, R. A.; Barnes, S. R.; Morrill, K.; Jacob, M. A. C.; Whitesel, C.; Masterson, P.; Batzel, G. O.; Johnston, H. T.; Ramirez, M. D.; Katz, P. S.; Lyons, D.
Show abstract
How novel phenotypes originate from conserved genes, processes, and tissues remains a major question in biology. Research that sets out to answer this question often focuses on the conserved genes and processes involved, an approach that explicitly excludes the impact of genetic elements that may be classified as clade-specific, even though many of these genes are known to be important for many novel, or clade-restricted, phenotypes. This is especially true for understudied phyla such as mollusks, where limited genomic and functional biology resources for members of this phylum has long hindered assessments of genetic homology and function. To address this gap, we constructed a chromosome-level genome for the gastropod Berghia stephanieae (Valdes, 2005) to investigate the expression of clade-specific genes across both novel and conserved tissue types in this species. The final assembled and filtered Berghia genome is comparable to other high quality mollusk genomes in terms of size (1.05 Gb) and number of predicted genes (24,960 genes), and is highly contiguous. The proportion of upregulated, clade-specific genes varied across tissues, but with no clear trend between the proportion of clade-specific genes and the novelty of the tissue. However, more complex tissue like the brain had the highest total number of upregulated, clade-specific genes, though the ratio of upregulated clade-specific genes to the total number of upregulated genes was low. Our results, when combined with previous research on the impact of novel genes on phenotypic evolution, highlight the fact that the complexity of the novel tissue or behavior, the type of novelty, and the developmental timing of evolutionary modifications will all influence how novel and conserved genes interact to generate diversity.
Chen, Z.-R.
Show abstract
BackgroundThe genome sizes of organisms vary widely (C-value paradox). There are non-transcribing regions in the genome that neither encode proteins nor RNA entities. There are several hypotheses about the function of these regions: one suggests that they are unannotated functional areas, while another views them as genomic isolation zones that reduce mutations in coding regions. MethodStatistical analysis was conducted on the transcribing regions (including areas annotated as genes and transcribed pseudogenes) and non-transcribing regions, protein-coding regions (Coding sequence, CDS), and genome sizes using annotation files from 63,866 species genomes in the NCBI RefSeq database. ResultsThere is a significant linear relationship between the size of non-transcribing genomic regions and overall genome size across species, with varying proportional coefficients among different phyla (realms for viruses). As genome size increases, the proportion of non-transcribing regions gradually rises, eventually approaching a linear proportional limit, resembling one arm of hyperbolic functions. Eukaryotes show high linear correlation, with the highest in Streptophyta and the lowest in Apicomplexa. In eukaryotes, the size of the coding region increases with genome size, but the increasing trend diminishes (proportionally decreases). In non-eukaryotes, the size of the coding region maintains a linear relationship with genome size. ConclusionThe size of non-transcribing region in species may be subject to some strict quantitative control mechanism, showing that genome and non-transcribing genome sizes increase proportionally with the expansion of the transcribing genome, indicating a strict balance between expansion and energy conservation. The proportion of non-transcribed genomes in eukaryotes is conservative (although the sequences are not), and the presence of non-transcribing genomes has significant implications for the evolution or survival of species. Thus, I propose a new hypothesis about the non-transcribing genome, that it is a space for generating new genes from scratch, and the different proportional coefficients among phyla are due to their different positions in energy transfer. Graphic Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/613789v1_ufig1.gif" ALT="Figure 1"> View larger version (28K): org.highwire.dtl.DTLVardef@dc3e88org.highwire.dtl.DTLVardef@18d70e8org.highwire.dtl.DTLVardef@efb92corg.highwire.dtl.DTLVardef@66068b_HPS_FORMAT_FIGEXP M_FIG C_FIG
Warwick Vesztrocy, A.; Glover, N.; Thomas, P. D.; Dessimoz, C.; Julca, I.
Show abstract
Gene duplication is a major evolutionary source of functional innovation. Following duplication events, gene copies (paralogues) may undergo various fates, including retention with functional modifications (such as sub-functionalisation or neo-functionalisation) or loss. When paralogues are retained, this results in complex orthology relationships, including one-to-many or many-to-many. In such cases, determining which one-to-one pair is more likely to have conserved functions can be challenging. It has been proposed that, following gene duplication, the copy that diverges more slowly in sequence is more likely to maintain the ancestral function --referred to here as "the least diverged orthologue (LDO) conjecture". This study explores this conjecture, using a novel method to identify asymmetric evolution of paralogues and apply it to all gene families across the Tree of Life in the PANTHER database. Structural data for over 1 million proteins and expression data for 16 animals and 20 plants were then used to investigate functional divergence following duplication. This analysis, the most comprehensive to date, revealed that whilst the majority of paralogues display similar rates of sequence evolution, significant differences in branch lengths following gene duplication can be correlated with functional divergence. Overall, the results support the least diverged orthologue conjecture, suggesting that the least diverged orthologue (LDO) tends to retain the ancestral function, whilst the most diverged orthologue (MDO) may acquire a new, potentially specialised, role.
Lynch, V. J.
Show abstract
There is a longstanding interest in whether the loss of complex characters is reversible (so-called "Dollos law"). Reevolution has been suggested for numerous traits but among the first was Kurten (1963), who proposed that the presence of the second lower molar (M2) of the Eurasian lynx (Lynx lynx) was a violation of Dollos law because all other Felids lack M2. While an early and often cited example for the reevolution of a complex trait, Kurten (1963) and Werdelin (1987) used an ad hoc parsimony argument to support their proposition that M2 reevolved in Eurasian lynx. Here I revisit the evidence that M2 reevolved in Eurasian lynx using explicit parsimony and maximum likelihood models of character evolution and find strong evidence that Kurten (1963) and Werdelin (1987) were correct - M2 reevolved in Eurasian lynx. Next, I explore the developmental mechanisms which may explain this violation of Dollos law and suggest that the reevolution of lost complex traits may arise from the reevolution of cis-regulatory elements and protein-protein interactions, which have a longer half-life after silencing that protein coding genes. Finally, I present a model developmental model to explain the reevolution M2 in Eurasian lynx.
Menezes, A. P. A.; Almeida, J. V. d. A.; Del-Bem, L.-E.; Lobo, F. P.
Show abstract
The abundance of plant genomic information caused by the decrease of sequencing costs contrasts with the lack of databases that integrate genome annotation, taxonomy and phenotypes to produce statistically sound, biologically meaningful knowledge. Here we present ARCADE (ARChaeplastida Annotation DatabasE), a database of 171 high-quality archaeplastidian non-redundant proteomes gathered from six primary genomic databases, together with proteome quality metrics anda growing number of associated metadata. As a case study to demonstrate the usefulness of ARCADE, we used it to investigate the expansion and contraction of protein domains associated with the evolution of genome size (hereafter GS). GS varies greatly among land plants and the synthesis of large genomes can be costly to cells. Although GS has been studied extensively for decades, the molecular mechanisms involved in the adaptations of plants to the increase in GS are still poorly understood. We used the annotation and phylogenetic information available in ARCADE, together with estimated GS values available for 83 land plant species, to search for associations between the abundance of protein domain families in these species and GS variation through phylogenetic-aware methods. Additionally, we estimated the GS for the ancestral nodes of the extant land plant species. GS seems to be decreasing along the course of evolution, except for a few branches that might have undergone independent GS increases. We found 7 Pfam correlated with the variation in GS in land plants, mainly related to nucleotide metabolism, DNA repair and genome organization. We found larger genomes to have a greater frequency of the Histone 2A superfamily, responsible for diverse functions, including the nucleosome formation and silencing of transposable elements. These molecular functions we found correlated to GS variation suggests they may be associated with preserving genome stability in larger genomes, and might indicate the evolution of mechanisms to cope with the variation in GS in land plants. ARCADE is available at https://bit.ly/ARCADE_OSF.
Poliseno, A.; Quattrini, A. M.; Lau, Y. W.; Pirro, S.; Reimer, J. D.; McFadden, C. S.
Show abstract
The complete mitochondrial genomes of octocorals typically range from 18.5 kb to 20.5 kb in length, and include 14 protein coding genes (PCGs), two ribosomal RNA genes and one tRNA. To date seven different gene orders (A-G) have been described, yet comprehensive investigations of the actual number of arrangements, as well as comparative analyses and evolutionary reconstructions of mitochondrial genome evolution within the whole subclass Octocorallia have been often overlooked. Here we considered the complete mitochondrial genomes available for octocorals and explored their structure and gene order variability. Our results updated the actual number of mitochondrial gene order arrangements so far known for octocorals from seven to twelve, and allowed us to explore and preliminarily discuss the role of some of the structural and functional factors in the mitogenomes. We performed comparative mitogenomic analyses on the existing and novel octocoral gene orders, considering different mitogenomic structural features such as genome size, GC percentage, AT- and GC-skewness. The mitochondrial gene order history mapped on a recently published nuclear loci phylogeny showed that the most common rearrangement events in octocorals are inversions, and that the mitochondrial genome evolution in the subclass is discontinuous, with rearranged gene orders restricted only to some regions of the tree. We believe that different rearrangement events arose independently and most likely that new gene orders, instead of being derived from other rearranged orders, came from the ancestral and most common gene order. Finally, our data demonstrate how the study of mitochondrial gene orders can be used to explore the evolution of octocorals and in some cases can be used to assess the phylogenetic placement of certain taxa.
Ng, W.
Show abstract
Genome architecture concerns the organisation of genes on a chromosome, and has important implications to the fidelity in which genes are encoded on the chromosome, and how the information is read by DNA polymerase and RNA polymerase. This facet of genomics did receive attention in the early epoch of genomics, but it has received less attention in contemporary genomics as attention shifts to structural and functional genomics with the goal of annotating the function of each gene in the genome. This work sought to uncover relationships between number of genes and chromosome length in a variety of bacteria and archaea species as a preamble to understanding the prevalence and importance of repetitive sequences in the genome of prokaryotic species. Aggregate results with the ensemble of prokaryotic species profiled revealed a positive linear correlation between number of genes and chromosome length. Upon dissection into the Bacteria and Archaea domains, the linear relationship described above still stands for Bacteria but starts to break down in Archaea. This suggests that repetitive sequences are more important to Archaea species, which generally have a smaller genome (1.8 to 2.8 Mbp) and fewer genes (1500 to 2400) compared to bacterial species. In comparison, the bacterial genome is larger (4 to 5.6 Mbp), and encodes more genes (3600 to 5100). Overall, the results highlight that bacterial genome are efficiently encoded with few repetitive sequences. This, however, is not true for archaeal genome, which provides another line of evidence supporting the notion that archaea are ancestral eukaryotic cells, which like the archaea also houses large repetitive sequences. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=141 SRC="FIGDIR/small/503871v1_ufig1.gif" ALT="Figure 1"> View larger version (15K): org.highwire.dtl.DTLVardef@1632506org.highwire.dtl.DTLVardef@13e91forg.highwire.dtl.DTLVardef@12e1316org.highwire.dtl.DTLVardef@1e7381e_HPS_FORMAT_FIGEXP M_FIG C_FIG Short descriptionStatistical analysis across an ensemble of 59 microbial species revealed a strong linear correlation between number of genes and chromosome length. This suggests that prokaryotic genomes are highly compact with genes, and do not carry significant amounts of repeats unlike the case in eukaryotic organisms. The result holds significant implications for our understanding of genome evolution and compaction in prokaryotic organisms, and what drove their accession as foundational species of many ecosystems. Subject areasgenomics, molecular biology, evolutionary biology, bioinformatics, systems biology,
Dilucca, M.; Forcelloni, S.; Cimini, G.; Giansanti, A.
Show abstract
In this work, we study the correlation between codon usage and the network features of the PPI in bacteria genomes. We want to extend the information by Dilucca et al. (2015) about E.Colis genome for a set of other 34 bacteria. We use PCA techniques in the space of codon bias indices (compAI, compAI_w, tAI, NC) and GC content to show that genes with similar patterns of codon usage feature have a significantly higher probability that their encoded proteins interact within the PPI. And vice-versa, we show that interacting in the PPI have a coherent codon usage. This work could allow for future investigations into the possible effects that codon bias signal can have on the topology of protein interaction network and, as such, to improve existing bioinformatics methods for predicting protein interactions.
Aylward, F. O.; Martinez-Gutierrez, C. A.
Show abstract
The evolutionary forces that determine genome size in bacteria and archaea have been the subject of intense debate over the last few decades. Although the preferential loss of genes observed in prokaryotes is explained through the deletional bias, factors promoting and preventing the fixation of such gene losses remain unclear. Moreover, statistical analyses on this topic have typically been limited to a narrow diversity of bacteria and archaea without considering the potential bias introduced by the shared recent ancestry of many lineages. In this study, we used a phylogenetic generalized least-squares (PGLS) analysis to evaluate the effect of different factors on the genome size of a broad diversity of bacteria and archaea. We used dN/dS to estimate the strength of purifying selection, and 16S copy number as a proxy for ecological strategy, which have both been postulated to play a role in shaping genome size. After model fit, Pagels lambda indicated a strong phylogenetic signal in genome size, suggesting that the diversification of this trait is strongly influenced by shared evolutionary histories. As a predictor variable, dN/dS showed a poor predictability and non-significance when phylogeny was considered, consistent with the view that genome reduction can occur under either weak or strong purifying selection depending on the ecological context. Copies of 16S rRNA showed poor predictability but maintained significance when accounting for non-independence in residuals, suggesting that ecological strategy as approximated from 16S rRNA copies might play a minor role in genome size variation. Altogether, our results indicate that genome size is a complex trait that is not driven by any singular underlying evolutionary force, but rather depends on lineage- and niche-specific factors that will vary widely across bacteria and archaea. Author SummaryThe evolutionary forces driving genome size in bacteria and archaea have been subject to debate during the last decades. Independent comparative analyses have suggested that unique variables, such as the strength of selection, environmental complexity, and mutation rate, are the main drivers of this trait, which complicates generalizations across the Tree of Life. Here, we applied a phylogeny-based statistical approach to assess how tightly genome size is linked to evolutionary history in bacteria and archaea. Moreover, we also evaluated the predictability of genome size from the strength of purifying selection and ecological strategy on a broad diversity of bacteria and archaea genomes. Our approach indicates that genome size in prokaryotes is strongly dependent on phylogenetic history, and that genome size is the result of the interaction of variables like past events, current selection regimes, and environmental complexity that are clade dependent.
Gil-Gomez, A. M.; Rest, J. S.
Show abstract
As species diverge, a wide range of evolutionary processes lead to changes in protein-protein interaction networks and metabolic networks. The rate at which biological networks evolve is an important question in evolutionary biology. Previous empirical work has focused on interactomes from model organisms to calculate rewiring rates, but this is limited by the relatively small number of species and sparse nature of network data across species. We present a proxy for variation in network topology: variation in drug-drug interactions (DDIs), obtained by studying drug combinations (DCs) across taxa. Here, we propose the rate at which DDIs change across species as an estimate of the rate at which the underlying biological network changes as species diverge. We computed the evolutionary rates of DDIs using previously published data from a high throughput study in gram-negative bacteria. Using phylogenetic comparative methods, we found that DDIs diverge rapidly over short evolutionary time periods, but that divergence saturates over longer time periods. In parallel, we mapped drugs with known targets in protein-protein interaction and co-functional networks. We found that the targets of synergistic DDIs are closer in these networks than other types of DCs and that synergistic interactions have a higher evolutionary rate, meaning that nodes that are closer evolve at a faster rate. Future studies of network evolution may use DC data to gain larger-scale perspectives on the details of network evolution within and between species.
Leung, A.; Chang, B. S. W.; Sage, R. F.
Show abstract
Enzymes are thought to be tuned to perform similarly in different thermal regimes. Whether the photosynthetic enzyme ribulose-1,5-bisphosphate carboxylase/oxygenase (rubisco) follows similar rules, especially when considering evolutionary history, is uncertain. The molecular, structural, and ecological factors of the rubisco large subunit (RbcL) were examined in four plant clades: wood ferns, pines, sea lavenders, and viburnums. Using rbcL gene sequences, codon evolutionary models were used to test for positive and divergent selection and convergent evolution. Protein structure modeling was performed to predict side chain changes and protein stability. Phylogenetic comparative methods were used to examine the relationship between protein stability to growing season temperature. All four clades showed significant evidence of positive selection, with multiple convergent substitutions predicted to alter side chain polarity and interactions with the solvent. In viburnums, biome transition rates were dependent on amino acid substitution, with positive selection was concentrated in cold temperate and cloud forest clades. Rubiscos with higher stability occurred in species from warmer environments. However, this correlation was weaker after correcting for phylogeny. These analyses support a hypothesis that RbcL evolution is influenced by both environmental tuning and evolutionary history.
Fullmer, M. S.; Puente-Lelievre, C.; Matzke, N. J.
Show abstract
The recent introduction of Foldseeks 3Di character alphabet to encode 3D protein structure has opened up new possibilities for structural phylogenetics. These characters, like protein structure, are more conserved than amino acids, raising the possibility of better resolution of very deep branches on the tree of life. As 3Di characters have a 20-letter alphabet, they are readily treatable with off-the-shelf algorithms for model-based phylogenetic inference and related methods such as bootstrapping. However, it remains to be seen if 3Di phylogenies are broadly more resolved than sequence-based phylogenies. We present data using samples from nine protein superfamilies showing that 3Di combines with sequence to produce better resolved phylogenies than either sequence or 3Di alone. We also show that information-theoretic measures, applied to superfamily alignments, significantly correlate with resolution in phylogenies derived from these alignments. Further, we identify the proportion of alpha helices in proteins as a major driver in reducing the information carried by 3Di character alignments, explaining the relatively poor performance of 3Di characters on superfamilies with highly-conserved structure but high alpha helical content. Our results provide encouragement for the further use of 3Di to address challenging questions in deep history, but also sound a note of caution about which proteins it is most suitable for. SIGNIFICANCE3Di characters have been suggested as a method to generate well-resolved deep phylogenies. Our results show that 3Di characters combined with sequences can improve resolution in the deepest nodes of protein superfamily trees. However, our results also show that 3Di characters may not be suitable for all protein types.
Pandey, A.; Braun, E. L.
Show abstract
MotivationProtein sequence evolution is a complex process that varies among-sites within proteins and across the tree of life. Comparisons of evolutionary rate matrices for specific taxa ( clade-specific models) have the potential to reveal this variation and provide information about the underlying reasons for those changes. To study changes in patterns of protein sequence evolution we estimated and compared clade-specific models in a way that acknowledged variation within proteins due to structure. ResultsClade-specific model fit was able to correctly classify proteins from four specific groups (vertebrates, plants, oomycetes, and yeasts) more than 70% of the time. This was true whether we used mixture models that incorporate relative solvent accessibility or simple models that treat sites as homogeneous. Thus, protein evolution is non-homogeneous over the tree of life. However, a small number of dimensions could explain the differences among models (for mixture models ~50% of the variance reflected relative solvent accessibility and ~25% reflected clade). Relaxed purifying selection in taxa with lower long-term effective population sizes appears to explain much of the among clade variance. Relaxed selection on solvent-exposed sites was correlated with changes in amino acid side-chain volume; other differences among models were more complex. Beyond the information they reveal about protein evolution, our clade-specific models also represent tools for phylogenomic inference. AvailabilityModel files are available from https://github.com/ebraun68/clade_specific_prot_models. Contactebraun68@ufl.edu Supplementary informationSupplementary data are appended to this preprint.
Zhang, D.; Zou, H.; Zhang, J.; Wang, G.; Jakovlic, I.
Show abstract
Inversions of the origin of replication (ORI) of mitochondrial genomes produce asymmetrical mutational pressures that can cause artefactual clustering in phylogenetic analyses. It is therefore an absolute prerequisite for all molecular evolution studies that use mitochondrial data to account for ORI events in the evolutionary history of their dataset. The number of ORI events in crustaceans remains unknown; several studies reported ORI events in some crustacean lineages on the basis of fully inversed (e.g. negative vs. positive) GC skew patterns, but studies of isolated lineages could have easily overlooked ORI events that produced merely a reduction in the skew magnitude. In this study, we used a comprehensive taxonomic approach to systematically study the evolutionary history of ORI events in crustaceans using all available mitogenomes and combining signals from lineage-specific skew magnitude and direction (+ or -), cumulative skew diagrams, and gene rearrangements. We inferred 24 putative ORI events (14 of which have not been proposed before): 17 with relative confidence, and 7 speculative. Most of these were located at lower taxonomic levels, but there are indications of ORIs that occurred at or above the order-level: Copepoda, Isopoda, and putatively in Branchiopoda and Poecilostomatida+Cyclopoida. Several putative ORI events did not result in fully inversed skews. In many lineages skew plots were not informative for the prediction of replication origin and direction of mutational pressures, but inversions of the mitogenome fragment comprising the ancestral CR (rrnS-CR-trnI) were rather good predictors of skew inversions. We also found that skew plots can be a useful tool to indirectly infer the relative strengths of mutational/purifying pressures in some crustacean lineages: when purifying pressures outweigh mutational, GC skew plots are strongly affected by the strand distribution of genes, and when mutational > purifying, GC skew plots can be even completely (apparently) unaffected by the strand distribution of genes. This observation has very important repercussions for phylogenetic and evolutionary studies, as it implies that not only the relatively rare ORI events, but also much more common gene strand switches and same-strand rearrangements can produce mutational bursts, which in turn affect phylogenetic and evolutionary analyses. We argue that such compositional biases may produce misleading signals not only in phylogenetic but also in other types of evolutionary analyses (dN/dS ratios, codon usage bias, base composition, branch length comparison, etc.), and discuss several such examples. Therefore, all studies aiming to study the evolution of mtDNA sequences should pay close attention to architectural rearrangements.
Gu, X.
Show abstract
Fluctuating selection among individuals (FSI) refers to any mutation that exhibits different fitness effects among individuals. Thus, the selection nature of a mutation (deleterious, neutral, or beneficial) should be interpreted by the means of the population average. For instance, a neutral mutation on average could be slightly deleterious in some individuals, and slightly beneficial in others. It has been recently demonstrated that the effect of FSI is important in molecular evolution especially when the effective population size (Ne) is not small. Intriguingly, a novel pattern of molecular evolution called selection duality, i.e., mutations that are statistically slightly beneficial are subject to a negative selection, emerges under the condition that selective advantage is less than FSI. While FSI sheds some lights on the long-term neutralist-selectionist debate, an immediate question is how to calculate the strength of FSI-genetic drift relative to the Ne-genetic drift. In this article we develop a statistical method the relative FSI strength (F): if F is close to 0, the Ne-genetic drift is dominant; whereas the FSI-genetic drift is dominant if F is close to 1. One may tentatively set F=0.5 as an empirical criterion to weigh between those two genetic drifts. Our case study showed that the relative FSI-strength F is over 0.5 in most species, suggesting that the FSI-genetic drift, rather than the Ne-drift, plays a major role in metazoan genome evolution.